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THE  MASKED  SAMPLE  COVARIANCE  ESTIMATOR: 

AN  ANALYSIS  VIA  THE  MATRIX  LAPLACE  TRANSFORM 

RICHARD  Y.  CHEN,  ALEX  GITTENS,  AND  JOEL  A.  TROPP 


Abstract.  Covariance  estimation  becomes  challenging  in  the  regime  where  the  number  p  of  vari¬ 
ables  outstrips  the  number  n  of  samples  available  to  construct  the  estimate.  One  way  to  circumvent 
this  problem  is  to  assume  that  the  covariance  matrix  is  nearly  sparse  and  to  focus  on  estimating 
only  the  significant  entries.  To  analyze  this  approach,  Levina  and  Vershynin  (2011)  introduce  a 
formalism  called  masked  covariance  estimation,  where  each  entry  of  the  sample  covariance  estimator 
is  reweighed  to  reflect  an  a  priori  assessment  of  its  importance. 

This  paper  provides  a  new  analysis  of  the  masked  sample  covariance  estimator  based  on  the 
matrix  Laplace  transform  method.  The  main  result  applies  to  general  subgaussian  distributions. 
Specialized  to  the  case  of  a  Gaussian  distribution,  the  theory  offers  qualitative  improvements  over 
earlier  work.  For  example,  the  new  results  show  that  n  =  0(Rlog 2 p)  samples  suffice  to  estimate 
a  banded  covariance  matrix  with  bandwidth  B  up  to  a  relative  spectral-norm  error,  in  contrast  to 
the  sample  complexity  n  =  0(Slog5p)  obtained  by  Levina  and  Vershynin. 


1.  Introduction 

A  fundamental  problem  in  multivariate  statistics  is  to  obtain  an  accurate  estimate  of  the  co- 
variance  matrix  of  a  multivariate  distribution  given  independent  samples  from  the  distribution. 
This  challenge  arises  whenever  we  need  to  understand  the  spread  of  the  data  and  its  marginals,  for 
example,  when  we  perform  regression  analysis  [Fre05]  or  principal  component  analysis  [Jol02] . 

In  the  classical  setting  where  the  number  of  samples  exceeds  the  number  of  variables,  the  behavior 
of  standard  covariance  estimators  is  textbook  material  [JW02,  Mui82,  MKB80].  The  random 
matrix  literature  also  contains  a  huge  amount  of  relevant  work;  we  refer  to  the  book  [BS10]  and 
the  survey  [Verll]  for  further  information. 

Modern  applications,  in  contrast,  often  involve  a  small  number  of  samples  and  a  large  number 
of  variables.  The  paucity  of  data  makes  it  impossible  to  obtain  an  accurate  estimate  of  a  gen¬ 
eral  covariance  matrix.  As  a  remedy,  we  must  frame  additional  assumptions  on  the  model  and 
develop  estimators  that  exploit  this  extra  structure.  Over  the  last  few  years,  a  number  of  papers, 
including  [FB07,  BL08b,  BL08a,  Kar08,  RLZ09,  CZZ10],  have  focused  on  the  situation  where  the 
covariance  matrix  is  sparse  or  nearly  so.  In  this  case,  we  imagine  that  we  could  limit  our  attention 
to  the  significant  entries  of  the  covariance  matrix  and  thereby  perform  more  accurate  estimation 
with  fewer  samples. 

This  paper  studies  a  particular  technique  for  the  sparse  covariance  problem  that  we  call  the 
masked  sample  covariance  estimator.  This  approach  uses  a  mask  matrix,  constructed  a  priori ,  to 
specify  the  importance  we  place  on  each  entry  of  the  covariance  matrix.  By  reweighting  the  sample 
covariance  estimate  using  a  mask,  we  can  reduce  the  influence  of  entires  that  we  cannot  estimate 
reliably.  For  instance,  if  the  covariance  matrix  is  approximated  by  a  banded  matrix,  one  sets 
the  entries  of  the  mask  to  zero  outside  of  the  band.  This  formalism  was  introduced  by  Levina  and 
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Vershynin  [LV11]  to  provide  a  unified  treatment  of  earlier  methods  for  sparse  covariance  estimation; 
we  refer  to  their  paper  for  a  more  detailed  discussion  of  prior  work. 

Levina  and  Vershynin  derive  an  elegant  bound  [LV11,  Thm.  2.1]  for  masked  covariance  estimation 
of  a  Gaussian  distribution.  In  this  work,  we  develop  a  completely  different  analysis  based  on  the 
matrix  Laplace  transform  method  [AW02,  Trollb].  The  advantage  of  this  approach  is  that  it  applies 
to  general  subgaussian  distributions  and  it  allows  us  to  obtain  more  refined  information  about  the 
quality  of  the  masked  sample  covariance  estimator. 

The  rest  of  this  Introduction  provides  an  overview  of  masked  covariance  estimation  and  its 
relationship  with  classical  covariance  estimation.  In  Section  1.6,  we  present  a  simplified  result  for 
the  behavior  of  the  masked  sample  covariance  estimator  applied  to  a  Gaussian  distribution,  and 
we  offer  a  concrete  comparison  with  the  results  of  Levina  and  Vershynin  [LV11,  Thm.  2.1].  More 
detailed  results  appear  in  Section  3. 

1.1.  Classical  Covariance  Estimation.  Consider  a  random  vector 

x  =  (x1,x2,...,xpy  eRP. 

Let  aq, . . .  xn  be  independent  random  vectors  that  follow  the  same  distribution  as  x.  For  simplicity, 
we  assume  that  the  distribution  is  known  to  have  zero  mean:  Ea:  =  0.  The  covariance  matrix  E 
is  a  p  x  p  matrix  that  tabulates  the  second-order  statistics  of  the  distribution: 

E  :=  E(xx*)  (1.1) 

where  *  denotes  the  transpose  operation.  The  classical  estimator  for  the  covariance  matrix  is  the 
sample  covariance  matrix ,  which  is  obtained  from  (1.1)  by  the  plug-in  principle: 

%n-.=  ~Y^1  ZLX*.  (L2) 

n  z — o=i 

The  sample  covariance  matrix  is  an  unbiased  estimator  of  the  covariance  matrix. 

Given  a  tolerance  e  6  (0,1),  we  can  study  how  many  samples  n  are  typically  required  to  provide 
an  estimate  with  relative  error  e  in  the  spectral  norm: 

E  ||En  —  E||  <  e  || E||  .  (1.3) 

This  type  of  spectral-nornr  error  bound  is  quite  powerful.  It  limits  the  magnitude  of  the  estimation 
error  for  each  entry  of  the  covariance  matrix;  it  provides  information  about  the  variance  of  each 
marginal  of  the  distribution  of  a:;  it  even  controls  the  error  in  estimating  the  eigenvalues  of  the 
covariance  using  the  eigenvalues  of  the  sample  covariance. 

Unfortunately,  an  error  bound  of  the  form  (1.3)  demands  a  lot  of  samples.  Suppose  that  the 
covariance  matrix  has  full  rank.  Then  the  number  of  samples  must  be  at  least  as  large  as  the 
number  of  variables  to  obtain  a  nontrivial  guarantee.  Indeed,  when  n  <  p,  the  sample  covariance 
does  not  even  have  full  rank,  so  the  spectral  norm  error  is  bounded  away  from  zero! 

Typical  positive  results  on  covariance  estimation  state  that  we  can  obtain  an  accurate  estimate 
for  the  covariance  matrix  when  the  number  of  samples  is  proportional  to  the  number  of  variables, 
provided  that  the  distribution  decays  fast  enough.  For  example,  assuming  that  x  follows  a  normal 
distribution, 

n  >  Cs~2p  ==>•  || E„  —  E  ||  <  e  || E ||  with  high  probability.  (1.4) 

We  use  the  analyst’s  convention  that  C  denotes  an  absolute  constant  whose  value  may  change  from 
appearance  to  appearance.  See  [Verll,  Thm.  57  et  seq.]  for  details  of  obtaining  the  bound  (1.4). 
The  work  of  Srivastava  and  Vershynin  [SV 11]  contains  the  most  recent  news  on  the  classical  co- 
variance  estimation  problem. 
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1.2.  Motivation  for  Masked  Covariance  Estimation.  In  the  regime  n  <C  p,  where  we  have 
very  few  samples,  we  can  never  hope  to  achieve  the  estimate  (1.3).  So  we  must  lower  our  standards. 
The  following  example  provides  some  insight  on  how  to  proceed. 

Example  1.1  (Simultaneous  Variance  Estimation).  Let  us  how  many  realizations  of  a  Gaussian 
random  vector  we  need  to  accurately  estimate  the  variance  of  each  component. 

First,  suppose  that  Z  is  a  zero-mean  normal  variable  with  variance  v.  Given  independent  copies 
Z\, . . . ,  Zn  of  the  random  variable  Z .  we  can  compute  the  sample  variance 


The  estimator  v  is  unbiased,  and,  up  to  scaling,  it  follows  a  chi-square  distribution,  so  the  probability 
of  error  satisfies 

P{|v-v|  >  8v}  <  2e-,rf2/4  forte  (0,1).  (1.5) 

For  a  clean  proof  of  this  type  of  inequality,  see  [Bar05,  Prop.  (2.2)]. 

Next,  suppose  that  the  random  vector  x  follows  a  zero-mean  normal  distribution  with  arbitrary 
covariance  S,  and  write  oy,-  for  the  (i,j)  entry  of  this  matrix.  When  we  use  the  sample  covariance 
to  estimate  each  of  the  p  diagonal  entries  of  S,  the  bound  (1.5)  implies  that 

P  jmaxj  |(Sn  -  S)j*|  >  (max,  a1A)  •  f  j  <  2 pe_ni2/4. 

We  conclude  that,  for  e  e  (0, 1), 

n  >  Ce_2log p  =>  max,  |(En  —  S)jj|  <  e  max,  an  with  high  probability.  (1.6) 

Since  max,  a  a  <  ||S||,  the  error  obtained  in  (1.6)  is  smaller  than  the  spectral-norm  error  in  (1.4). 

When  the  covariance  S  =  I,  it  can  be  shown  that  at  least  logp  samples  are  required  to  achieve 
the  bound  (1.6). 

Example  1.1  suggests  an  intriguing  possibility.  Although  we  need  at  least  p  samples  to  estimate 
the  entire  covariance  matrix,  roughly  logp  samples  suffice  to  estimate  the  diagonal.  It  turns  out 
that  this  phenomenon  is  generic:  If  we  estimate  only  a  small  portion  of  the  covariance  matrix,  then 
we  can  reduce  the  number  of  samples  dramatically.  This  observation  is  widely  applicable  because 
there  are  many  problems  where  we  do  not  need  to  know  all  of  the  second-order  statistics. 

Partitioning  Variables:  Suppose  that  we  divide  the  stock  market  into  disjoint  sectors,  and 
we  would  like  to  study  the  interactions  among  the  monthly  returns  for  stocks  within  each 
sector.  The  list  of  returns  for  all  the  stocks  can  be  treated  as  a  random  vector.  We  block 
the  covariance  matrix  of  this  random  vector  to  conform  with  the  market  sectors,  and  we 
estimate  only  the  entries  in  the  diagonal  blocks. 

Spatial  or  Temporal  Localization:  A  simple  random  model  for  grayscale  images  treats 
the  intensity  of  each  pixel  as  a  random  variable.  Nearby  pixels  tend  to  be  bright  or  dark 
together,  while  distant  pixels  are  usually  uncorrelated.  Thus,  we  might  limit  our  attention 
to  the  interactions  between  a  pixel  and  the  pixels  directly  adjacent  to  it.  This  model 
suggests  that  we  estimate  the  entries  of  the  covariance  that  lie  within  a  (generalized)  band 
about  the  diagonal. 

Graph  Structures:  Consider  a  stochastic  model  for  the  spread  of  an  epidemic  through  a 
social  network.  At  each  time  instant,  we  label  an  individual  with  a  random  variable  that 
measures  how  sick  he  is.  Since  transmission  only  occurs  along  links  in  the  network,  neighbors 
are  likely  to  be  sick  or  well  together.  As  a  result,  we  might  want  to  focus  on  estimating  the 
covariance  for  individuals  separated  by  one  degree.  In  this  case,  the  adjacency  matrix  of 
the  graph  determines  which  pairs  to  estimate. 
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1.3.  The  Mask  Matrix.  We  can  treat  all  the  examples  from  Section  1.2  using  a  formalism  that 
was  introduced  by  Levina  and  Vershynin  [LV 1 1] .  Let  M  be  a  fixed  p  x  p  symmetric  matrix  with 
real  entries,  which  we  call  the  mask  matrix.  The  basic  idea  is  to  construct  a  mask  that  guides  our 
attention  to  specific  parts  of  the  covariance  matrix. 

In  the  simplest  case,  the  mask  has  0-1  values  that  indicate  which  entries  of  the  covariance  we 
must  attend  to.  The  presence  of  a  unit  entry  m.ij  =  1  tells  us  to  estimate  the  interaction  between 
the  ith  and  jth  variable;  a  zero  entry  m.ij  =  0  means  that  we  abdicate  from  making  any  estimate 
of  this  interaction.  In  Example  1.1,  we  are  only  interested  in  the  diagonal  entries  of  the  covariance, 
so  we  are  using  the  mask  jag  =  I.  Here  are  some  other  basic  examples: 


M, 


group  •“ 


‘l  1 

'l  1 

‘i  ii' 

1  1 

1  1  1 

l  l 

1  1 

j  -^band  •  — 

1  1  1 

i  -^^graph  • 

l  l  l 

1  1 

1  1  1 

l  l 

1 

1  1 

l  l  l 

The  matrix  IWgroup  corresponds  to  the  case  where  we  partition  variables  into  three  subgroups,  and 
we  make  estimates  only  within  subgroups.  Masks  such  as  Mband  arise  from  banded  covariance 
estimation,  which  occurs  for  spatially  localized  random  fields.  The  mask  ALgraph  might  occur  when 
the  variables  exhibit  a  graphical  dependency  structure. 

In  more  complicated  situations,  we  can  allow  the  mask  to  take  arbitrary  nonnegative  values  and 
then  interpret  the  magnitude  of  each  entry  as  a  requirement  on  the  precision  of  the  estimate.  When 
m.ij  is  large,  we  must  study  the  interaction  between  the  ith  and  jth  variable  carefully.  When  mij 
is  small,  we  are  less  vigilant  about  how  well  we  estimate  the  (i,  j)  entry  of  the  covariance  matrix. 
An  example  of  a  mask  with  general  entries  is  the  Kac  matrix 


ALKac 


1  ip  p2  p3  p4 

t  2  S 

(f  1  (p  p  (f° 

p2  p  l  p  p2 

p3  p2  p  l  p 

p4  p3  p2  p  l 


where  p  G  (0, 1). 


The  mask  Mk&c  tapers  the  covariances  exponentially  depending  on  the  distance  \i  —  j\  between  the 
variables.  This  type  of  example  might  be  relevant  for  the  study  of  spatially  localized  processes. 

Most  of  the  regularization  techniques  for  sparse  covariance  estimation  studied  in  the  literature, 
such  as  [BL08b,  FB07,  CZZ10],  can  be  described  using  mask  matrices.  The  initial  works  focus 
on  specific  cases,  such  as  banded  masks  and  tapered  masks,  whereas  we  have  followed  Levina  and 
Vershynin  [LV  11]  by  allowing  an  arbitrary  symmetric  matrix  M.  We  refer  to  the  papers  cited  in 
this  paragraph  for  further  background  and  references. 


Remark  1.2.  Let  us  emphasize  that  the  entries  of  the  mask  can  take  both  positive  and  negative 
values,  but  it  is  harder  to  find  a  clear  interpretation  of  a  mask  that  has  negative  entries. 


1.4.  The  Masked  Sample  Covariance  Estimator.  Suppose  that  we  have  specified  asymmetric 
pxp  mask  M  with  real  entries.  The  masked  covariance  and  the  masked  sample  covariance  estimator 
are  the  two  matrices 

M  ©  £  and  M  0  Sn, 

where  the  symbol  ©  denotes  the  componentwise  (i.e. ,  Schur  or  Hadamard)  product.  The  goal  of 
this  work  is  to  study  the  error  incurred  when  we  estimate  the  masked  covariance  matrix  using  the 
masked  sample  covariance: 

II M  0  En  -  M  0  £ 


(1.7) 
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As  noted  by  Levina  and  Vershynin  [LV11,  Sec.  1],  control  on  the  error  (1.7)  also  delivers  information 
about  how  well  we  estimate  the  full  covariance  because 

|| M  ©  Sn  -  E||  <  \\M  ©  Sn  -  M  ©  E||  +  \\M  ©  S  -  S|| .  (1.8) 

The  first  term  in  (1.8)  reflects  the  variance  of  the  estimator  about  its  mean  value,  while  the  second 
term  represents  the  bias  in  the  estimate  owing  to  the  presence  of  the  mask.  It  is  important  to  select 
a  mask  AT  that  simultaneously  controls  both  the  variance  and  the  bias.  Understanding  the  variance 
term  requires  an  excursion  into  random  matrix  theory,  and  it  comprises  the  main  subject  of  this 
work.  Studying  the  bias  term  involves  only  a  deterministic  analysis,  which  should  be  undertaken 
with  a  specific  application  in  mind. 

When  the  error  (1.7)  is  small,  the  masked  sample  covariance  yields  accurate  estimates  for  each 
component  of  the  covariance  where  the  corresponding  entry  of  AT  is  large,  as  well  as  the  variance  of 
some  specially  chosen  marginals.  When  the  error  (1.8)  is  also  small,  the  masked  sample  covariance 
provides  additional  information  about  the  variance  of  all  marginals  of  the  distribution  of  a;,  as  well 
as  estimates  for  the  eigenvalues  of  the  covariance. 

1.5.  The  Complexity  of  a  Mask.  The  number  of  samples  we  need  to  control  (1.7)  depends  on 
“how  much”  of  the  covariance  matrix  we  are  attempting  to  estimate.  We  quantify  the  complexity 
of  the  mask  using  two  separate  metrics.  First,  define  the  square  of  the  maximum  column  norm  of 
the  mask  matrix: 

\\M\\{^2  :=  maxj  • 

Roughly,  the  parenthesis  reflects  the  number  of  interactions  we  want  to  estimate  that  involve  the 
variable  j,  and  the  maximum  computes  a  bound  over  all  p  variables.  The  second  metric  is  the 
spectral  norm  ||AT||  of  the  mask  matrix,  which  provides  a  more  global  view  of  the  complexity  of 
the  interactions  that  we  estimate. 

Some  examples  may  illuminate  how  these  metrics  reflect  the  properties  of  the  mask.  First, 
suppose  that  we  estimate  the  entire  covariance  matrix,  so  the  mask  is  the  matrix  of  ones: 

AT  =  matrix  of  ones  =$■  l|AT||1_>2=p  and  ||AT||=p. 

We  will  see  that  the  value  p  here  corresponds  with  the  factor  p  in  the  sample  complexity  bound  (1.4). 
Next,  consider  the  mask  that  arises  in  banded  covariance  estimation: 

AT  =  0~~1  matrix,  bandwidth  B  ==>•  ||AT||^2  <  B  and  ||AT||  <  B 

because  there  are  at  most  B  ones  in  each  row  and  column.  When  the  banded  mask  is  much 

less  complex  than  the  matrix  of  ones,  and  estimation  is  commensurately  easier.  Third,  assuming 
the  mask  is  a  Kac  matrix,  we  have 

„  „2  1  ....  1 

AT  =  Kac  matrix,  parameter  p  =>  AT  L  9  <  - ~  and  AT  <  - . 

1  —  pz  1  —  ip 

For  a  fixed  value  of  ip,  neither  quantity  depends  on  the  total  number  of  variables,  so  covariance 
estimation  with  this  mask  should  require  very  few  samples. 

Remark  1.3.  In  each  example  above,  the  two  metrics  take  very  similar  values,  but  this  coincidence 
does  not  always  occur.  Although  the  spectral  norm  dominates  the  maximum  column  norm,  the 
square  of  the  maximum  column  norm  can  be  substantially  larger  or  substantially  smaller  than  the 
spectral  norm.  We  have  omitted  examples  to  support  this  point  because  they  do  not  seem  to  arise 
naturally  in  the  setting  of  masked  covariance  estimation. 
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1.6.  Masked  Covariance  Estimation  for  Gaussian  Distributions.  This  paper  develops  a 
bound  for  the  estimation  error  (1.7)  when  the  random  vector  x  follows  a  subgaussian  distribution 
with  zero  mean.  For  illustrative  purposes,  this  section  focuses  on  the  simpler  case  where  the  random 
vector  has  a  normal  distribution.  The  general  results  appear  in  Section  3. 

Theorem  1.4  (Masked  Covariance  Estimation  for  Gaussian  Distributions).  Fix  a  pxp  symmetric 
mask  matrix  M .  Suppose  that  x  is  a  Gaussian  random  vector  in  with  mean  zero.  Define  the 
covariance  matrix  X  and  the  sample  covariance  matrix  Xn  as  in  (1.1)  and  (1.2).  Then  the  expected 
estimation  error  satisfies 

M  (1^2  log(6p)  \  1 1 M 1 1  log2(6np) 

n  J  n 

Theorem  1.4  is  a  simplified  version  of  Corollary  3.3.  The  reader  is  encouraged  to  examine  the 
full  result,  which  includes  several  substantial  refinements. 

Remark  1.5.  In  the  actual  practice  of  covariance  estimation,  we  center  each  sample  empirically  by 
subtracting  the  sample  mean  x  =  n-1  xi-  The  sample  covariance  (1.2)  is  computed  using 
the  centered  samples  x,  =  Xi  —  x  instead  of  the  original  samples  xt.  The  theory  in  this  paper  can 
be  extended  to  cover  the  masked  covariance  estimator  formed  with  centered  samples;  see  [LV11, 
Rem.  4]  for  the  details  of  the  argument. 

1.6.1.  Sample  Complexity  Bound.  Theorem  1.4  allows  us  to  develop  conditions  on  the  number  n 
of  samples  that  we  need  to  control  the  estimation  error  with  high  probability.  Markov’s  inequality 
can  be  used  to  convert  (1.9)  into  an  error  bound  that  holds  in  probability.  For  example,  with 
probability  at  least  99%, 

Af||j^2log p\  1 1 M 1 1  log2 (np) 

n  )  n 

For  stronger  exponential  error  bounds,  we  refer  to  Corollary  3.3.  To  obtain  the  sample  complexity, 
assume  that  n  <  p,  and  let  e  £  (0, 1).  Then  (1.10)  yields  the  statement 

n>  C  e~2  ||M||2^2  logp  +  e_1  \\M\\  log2  p  =>  ||  M  0  S„  —  M  ©  E||  <  e  ||E||  (1.11) 

with  probability  at  least  99%. 

1.6.2.  Is  this  Sample  Complexity  Bound  Optimal?  Levina  and  Vershynin  show  that  the  sample 
complexity  of  masked  covariance  estimation  must  exhibit  a  logarithmic  dependence  on  the  number 
p  of  variables  [LV11,  Rem.  3].  They  also  argue  that  there  should  be  a  linear  dependence  on  the 
maximum  number  of  interactions  that  involve  a  single  variable  [LV11,  Eqn.  (1.4)  et  seq.];  this  term 
appears  in  (1.11)  in  the  guise  of  ||.M'||1_>2.  As  a  consequence  of  these  observations,  it  seems  plausible 
that  the  first  summand  in  the  sample  bound  (1.11)  has  the  optimal  form.  On  the  other  hand,  we 
believe  that  the  factor  log2  p  in  the  second  summand  could  probably  be  reduced  to  log  p. 

The  discussion  in  Example  1.1  suggests  that  it  may  be  possible  to  improve  the  dependence  of 
the  sample  complexity  bound  (1.11)  on  the  spectral  norm  ||S||  of  the  covariance.  Indeed,  we  have 
obtained  a  refinement  of  this  type.  See  Corollary  3.3  for  details. 

1.6.3.  Application  Example.  Consider  the  banded  covariance  estimation  problem,  with  the  mask 

M  =  0-1  matrix  with  bandwidth  B. 

See  the  matrix  Afband  displayed  on  page  4  for  an  instance  with  B  =  3  and  p  =  5.  The  sample 
complexity  bound  (1.11)  and  the  norm  calculations  from  Section  1.5  demonstrate  that 

n  >  C  [e~2B logp  +  £~1Blog2  p] 


E||.  (1.10) 


(1.12) 
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is  sufficient  to  provide  a  relative  estimation  error  e  in  spectral  norm  with  99%  probability.  For 
comparison,  recall  the  sufficient  condition  (1.4)  that  the  sample  complexity  for  estimating  the 
entire  covariance  with  relative  error  e  satisfies 

n>  C  e~2p. 

When  the  bandwidth  is  much  smaller  than  the  number  of  variables  ( B  <p),  the  masked  covariance 
estimator  outperforms  the  classical  covariance  estimator.  On  the  other  hand,  when  the  bandwidth 
is  comparable  with  the  number  of  variables,  the  analysis  of  the  masked  covariance  estimator  gives 
a  sample  complexity  bound  (1.12)  that  is  worse  by  a  polylogarithmic  factor. 

We  remark  that,  when  e  is  constant,  the  second  summand  in  (1.12)  always  dominates  the  first 
as  p  — >  oo.  On  the  other  hand,  the  first  summand  is  larger  when  e  <  log^1^.  In  other  words,  the 
excess  logarithm  in  the  second  term  of  (1.12)  does  not  have  an  impact  on  the  sample  complexity 
when  we  are  seeking  highly  accurate  covariance  estimates. 


1.6.4.  Comparison  with  Bounds  of  Levina  and  Vershynin.  Theorem  1.4  should  be  compared  with 
the  main  result  of  Levina  and  Vershynin  [LV11,  Thm.  2.1],  which  states  that 

M\\1^.2log5/2p  |  ||M||log3p 
y/n  n 


E  M  0  -  M  ©  S  <  C 


The  associated  sample  complexity  bound  is 


n  >  C 


M\\l^2  log5  P 


£  1  \\M\\  log 3 p 


(1.13) 


Our  sample  complexity  bound  (1.11)  has  exactly  the  same  structure  as  (1.13),  but  we  have  managed 
to  remove  a  moderate  number  of  logarithms. 

We  do  not  feel  that  chopping  down  logs  is  an  interesting  pursuit  per  se.  Instead,  the  value 
of  this  work  stems  from  the  fact  that  we  have  applied  an  argument  that  is  completely  different 
from  previous  work  on  masked  covariance  estimation.  Our  approach  provides  some  qualitative 
refinements  over  Levina  and  Vershynin’s  bound  in  the  Gaussian  setting  (Corollary  3.3),  and  it  also 
extends  to  the  general  subgaussian  distributions  (Theorem  3.2). 


1.6.5.  Proof  Techniques.  The  argument  in  this  paper  is  based  on  a  recent  set  of  ideas,  collectively 
known  as  the  matrix  Laplace  transform  method.  This  approach  can  be  regarded  as  a  generalization 
of  the  classical  technique,  attributed  to  Bernstein,  that  develops  probability  inequalities  for  a 
random  variable  in  terms  of  bounds  for  its  cumulant  generating  function.  Tropp  [Trollb],  building 
on  work  of  Ahlswede  and  Winter  [AW02],  demonstrates  that  the  scalar  approach  admits  a  tight 
analogy  in  the  matrix  setting.  See  Section  2.4  for  an  overview  of  this  technique. 

The  matrix  Laplace  transform  method  is  particularly  well  suited  for  studying  sums  of  independent 
random  matrices.  To  apply  these  techniques,  we  express  the  error  as  a  sum  of  i.i.d.  random  matrices, 
each  with  zero  mean: 

M  0  Y]n  —  M  0  £  =  —  V  M  0  (x1:x*  —  Kxx*). 

n  r-^  i— l 

The  main  challenge  is  to  study  the  matrix  cumulant  generating  function  of  each  summand: 

logEexp  (8M  0  (xiX*  —  Eccx*))  for  0  >  0.  (1.14) 

The  key  technical  result  of  this  paper  is  a  semidefinite  upper  bound  for  the  matrix  cgf  (1.14). 
This  estimate  requires  a  number  of  substantial  new  ideas,  including  a  symmetrization  argument, 
a  careful  analysis  of  the  variance  of  the  random  matrix  in  the  exponent  of  (1.14),  and  a  delicate 
truncation  bound. 
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1.7.  Organization  of  the  paper.  The  rest  of  the  paper  is  organized  as  follows.  Section  2  in¬ 
troduces  our  notation  and  some  preliminaries.  Section  3  presents  the  main  result  for  zero-mean 
subgaussian  distributions,  together  with  its  proof  and  the  proof  of  Theorem  1.4.  In  Section  4,  we 
deal  with  the  technical  challenge  of  estimating  the  matrix  cumulant  generating  function  (1.14). 


2.  Preliminaries 

This  section  sets  out  the  background  material  we  require  for  the  proof.  The  argument  depends  on 
a  very  recent  set  of  ideas,  collectively  known  as  the  matrix  Laplace  transform  method.  We  introduce 
the  main  results  from  this  theory  in  Section  2.4,  and  we  provide  references  to  the  primary  sources. 
The  rest  of  the  material  here  is  more  or  less  standard.  Section  2.1  states  our  notational  conventions, 
Section  2.2  describes  some  basic  properties  of  the  Schur  product,  and  Section  2.3  includes  key  facts 
about  subgaussian  random  variables. 

2.1.  Notation  and  Conventions.  In  this  paper,  we  work  exclusively  with  real  numbers.  Plain 
italic  letters  always  refer  to  scalars.  Bold  italic  lowercase  letters,  such  as  a,  refer  to  column  vectors. 
Bold  italic  uppercase  letters,  such  as  A,  denote  matrices.  All  matrices  in  this  work  are  square; 
the  dimensions  are  determined  by  context.  We  write  0  for  the  zero  matrix  and  I  for  the  identity 
matrix.  The  matrix  unit  E ij  has  a  unit  entry  in  the  (i.  j)  position  and  zeros  elsewhere. 

The  symbol  *  denotes  the  transpose  operation  on  vectors  and  matrices.  We  use  the  term  self- 
adjoint  to  refer  to  a  matrix  that  satisfies  A  =  A*  to  avoid  confusion  between  symmetric  matrices 
and  symmetric  random  variables.  Curly  inequalities  refer  to  the  positive-semidefinite  partial  order¬ 
ing  on  self-adjoint  matrices:  A  =4  B  if  and  only  if  B  —  A  is  positive  semidefinite. 

The  function  diag(-)  maps  a  vector  a  to  a  matrix  whose  diagonal  entries  correspond  with  the 
entries  of  a.  We  write  tr(-)  for  the  trace  of  a  matrix.  The  symbol  ©  denotes  the  componentwise 
(i.e.,  Schur  or  Hadamard)  product  of  two  matrices. 

We  write  ||  ■  ||  for  both  the  1 2  vector  norm  and  the  associated  operator  norm,  which  is  usually  called 
the  spectral  norm.  The  norm  ||-||  returns  the  absolute  maximum  entry  of  a  vector.  For  clarity,  we 
use  a  separate  notation  ||-||max  for  the  absolute  maximum  entry  of  a  matrix.  The  maximum  column 
norm  ||-||1_>2  is  defined  as 

ii^iii-2  :=  maxf  m2) 7  • 

The  notation  reflects  the  fact  that  this  is  the  natural  norm  for  linear  maps  from  into  i 2 ■ 

We  reserve  the  symbol  e  for  a  Rademacher  random  variable,  which  takes  the  two  values  ±1  with 
equal  probability.  We  also  assume  that  all  random  variables  are  sufficiently  regular  that  we  are 
justified  in  computing  expectations,  interchanging  limits,  and  so  forth. 

2.2.  Facts  about  the  Schur  Product.  The  proof  depends  on  some  basic  properties  of  Schur 
products.  The  first  result  is  a  simple  but  useful  algebraic  identity.  For  each  square  matrix  A  and 
each  conforming  vector  x, 

A  ©  xx*  =  diag(*)Adiag(a;).  (2.1) 

The  second  result  states  that  the  Schur  product  with  a  positive-semidefinite  matrix  is  order  pre¬ 
serving.  That  is,  for  a  fixed  positive-semidefinite  matrix  A, 

B\  ©  B2  implies  A  0  Si  =<!  A  ©  B2.  (2-2) 

This  property  follows  from  Schur’s  theorem  [HJ94,  Thm.  7.5.3],  which  demonstrates  that  the  Schur 
product  of  two  positive-semidefinite  matrices  remains  positive  semidefinite. 
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2.3.  Subgaussian  Random  Variables.  There  are  several  different  ways  to  formalize  the  concept 
of  a  random  variable  that  decays  faster  than  a  Gaussian  random  variable  [Verll],  For  the  purposes 
of  this  paper,  the  following  definition  is  most  convenient. 

Definition  2.1  (Subgaussian  random  variable).  A  random  variable  X  is  subgaussian  if  there  exists 
a  positive  constant  K  such  that 

P{|X|  >  t}  <  2e_f2/A"  for  all  t  >  0. 

The  subgaussian  coefficient  k(X)  is  defined  to  be  the  infimal  K  for  which  this  inequality  holds. 

We  can  bound  all  the  moments  of  a  subgaussian  random  variable  X  in  terms  of  its  subgaussian 
coefficient: 

roc  roc 

E\X\q  =  qt^F^Xl  >  t}  dt  <  /  qt*-1 -2e~t2/KiX)2  dt  =  2n{X)qr{q/2  +  l). 

Jo  Jo 

In  particular,  the  raw  fourth  moment  of  X  satisfies 

E\X\4  <Ak{X)4.  (2.3) 

2.4.  The  Matrix  Laplace  Transform  Method.  In  classical  probability,  the  Laplace  transform 
method  is  a  powerful  tool  for  obtaining  tail  bounds  for  a  sum  of  independent  random  variables. 
In  their  influential  paper  [AW02],  Ahlswede  and  Winter  describe  a  generalization  of  the  Laplace 
transform  method  that  applies  to  a  sum  of  independent  random  matrices.  Subsequent  papers 
by  Oliveira  [OlilOa,  OlilOb],  by  Tropp  [Trollb,  Trolla],  and  by  Hsu  et  al.  [HKZ11]  all  contain 
substantial  refinements  and  extensions  of  the  original  idea.  Altogether,  these  tools  are  easy  to  use, 
remarkably  effective,  and  widely  applicable. 

In  analogy  with  the  scalar  case,  we  study  large  deviations  using  a  matrix  version  of  the  moment 
generating  function  (mgf)  and  the  cumulant  generating  function  (cgf).  Let  Z  be  a  self-adjoint  ran¬ 
dom  matrix.  Using  the  matrix  exponential,  we  define  the  matrix  mgf  and  matrix  cgf,  respectively, 
to  be 

Mz{9)  :  =  Eeez  and  5Z(0)  :=  logEe0Z  for  9  €  R. 

Note  that  these  expectations  may  not  exist  for  all  values  of  9.  The  matrix  cgf  can  be  interpreted  as 
an  exponential  mean ,  an  average  that  emphasizes  large  deviations  of  the  spectrum  with  the  same 
sign  as  the  parameter  9. 

The  matrix  mgf  contains  valuable  information  about  the  behavior  of  the  maximum  eigenvalue 
of  a  symmetric  random  matrix.  The  following  result  is  a  matrix  analog  of  the  classical  approach  to 
large  deviations,  which  is  attributed  to  Bernstein. 

Proposition  2.2  (Matrix  Laplace  transform  bound).  Let  Z  be  a  random,  self-adjoint  matrix.  For 
each  i6l, 

P{Amax(^)  >t}<  inf  [e~et  •  Etre0Z}  .  (2.4) 

In  this  form,  Proposition  2.2  is  due  to  Oliveira  [OlilOb,  Sec.  3],  but  the  main  idea  goes  back  to  the 
paper  [AW02]  of  Ahlswede  and  Winter.  See  [Trollb,  Prop.  3.1]  for  a  succinct  proof. 

In  our  application,  the  random  matrix  Z  can  be  expressed  as  a  sum  of  i.i.d.  zero-mean  ran¬ 
dom,  self-adjoint  matrices.  The  argument  relies  on  a  symmetrization  procedure,  which  introduces 
additional  randomness  into  the  series. 

Proposition  2.3  (Symmetrization  bound).  Consider  a  sequence  of  independent,  ran¬ 

dom,  self-adjoint  matrices.  For  each  9  £  M, 

E  tr  exp  (E”_  i  9(Yi  -EW))  <  Etrexp  2e£iYi)  » 

where  {£*}  are  independent  Rademacher  random  variables  that  are  also  independent  from  {Yj}. 
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The  proof  of  Proposition  2.3  is  essentially  identical  with  the  proof  of  Lemma  7.6  in  [Trollb],  so  we 
omit  the  argument. 

The  matrix  Laplace  transform  method  derives  its  power  from  a  deep  technical  result  that  allows 
us  to  bound  the  mgf  of  a  sum  of  independent  random  matrices  in  terms  of  the  cgfs  of  the  summands. 
We  state  a  simplified  version  of  this  fact  that  suits  our  needs. 


Proposition  2.4  (Subadditivity  of  cgfs).  Let  Y  be  a  random,  self-adjoint  matrix.  Consider  a 
finite  sequence  of  independent  copies  of  Y .  For  each  0  6  M, 

Etrexp  l  dY^j  <  tr  exp  (n  log  E  eeY  j  . 

Proposition  2.4  is  due  to  Tropp  [Trollb,  Lem.  3.4],  The  main  ingredient  in  the  proof  is  a  celebrated 
concavity  theorem  established  by  Lieb  [Lie73,  Thm.  6]. 

We  use  these  techniques  to  develop  a  matrix  Bernstein  inequality  that  is  adapted  for  partial 
covariance  estimation.  The  final  ingredient  in  our  argument  is  a  matrix  mgf  bound  that  parallels 
the  classical  mgf  bound  underlying  Bernstein’s  inequality. 


Proposition  2.5  (Bernstein  matrix  mgf  bound).  Let  Y  be  a  random,  self-adjoint  matrix  that 
satisfies 

E"P  =  0  and  Amax(P)  <  R  almost  surely. 


When  6  e  (O,/?-1), 


Ee®v  ^  1  + 


9 2 

2(1  -  OR) 


E(r2). 


Proposition  2.5  follows  immediately  from  [Trollb,  Lem.  6.7]  and  the  classical  inequality 

e0R  -OR -l  02 


R2 


< 


2(1  -  OR) 


valid  for  9  E  (0,  R 


-L 


We  can  verify  this  bound  by  comparing  derivatives.  The  constants  in  this  inequality  can  be  im¬ 
proved,  but  we  have  chosen  the  version  here  to  streamline  other  aspects  of  the  argument. 


3.  Masked  Covariance  Estimation  for  a  Subgaussian  Distribution 

In  this  section,  we  state  and  prove  our  main  error  estimates  for  masked  covariance  estimation. 
Section  3.1  defines  two  concentration  parameters  that  measure  the  spread  of  the  distribution.  We 
present  the  main  theorem  for  subgaussian  distributions  in  Section  3.2,  and  we  specialize  to  Gaussian 
distributions  in  Section  3.3.  Section  3.4  shows  how  to  derive  the  result  for  Gaussian  matrices  from 
the  main  theorem.  Finally,  we  establish  the  main  result  in  Section  3.5. 

3.1.  Concentration  Parameters.  The  effectiveness  of  the  masked  sample  covariance  estimator 
depends  on  the  concentration  properties  of  the  distribution  of  x.  Let  us  introduce  two  quantities 
that  measure  different  facets  of  the  variation  of  the  random  vector. 

The  subgaussian  coefficient  k(x)  of  the  distribution  is  defined  to  be  the  maximum  subgaussian 
coefficient  of  a  single  component  of  the  vector: 

k(x)  :=  max*  n{Xf).  (3.1) 

In  other  words,  we  assume  that  each  component  of  the  distribution  exhibit  subgaussian  decay  with 
variance  controlled  by  n(x)2. 

We  do  not  need  every  marginal  of  the  distribution  to  be  subgaussian  with  controlled  variance, 
but  we  do  require  some  information  on  the  spread  of  the  distribution  in  other  directions.  Define 
the  uniform  fourth  moment  v{x)  by  the  formula 

u(x)  :=  sup  (E  \u*x\A)1^. 

IMI=i 


(3.2) 
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The  uniform  fourth  moment  measures  how  much  the  worst  marginal  varies. 

Note  that  both  k(x)  and  v(x)  have  the  same  homogeneity  as  the  random  vector  x.  (This 
property  is  sometimes  expressed  by  saying  that  the  quantities  have  the  same  dimension,  the  same 
units,  or  the  same  scaling.)  As  a  consequence,  n2(x)  and  v2(x)  have  the  same  homogeneity  as  the 
covariance  matrix  XI. 

In  the  sequel,  we  abbreviate  k  :=  n(x)  and  v  :=  v{x)  whenever  the  distribution  of  the  random 
vector  x  is  clear. 


Remark  3.1.  For  Gaussian  distributions,  the  uniform  fourth  moment  v  always  dominates  the  sub- 
gaussian  coefficient  k.  In  the  worst  case,  v  can  be  much  larger  than  k.  Indeed,  suppose  that  A 
is  a  standard  normal  random  variable,  and  consider  the  random  vector  x  =  (A,  A, . . . ,  A)*  £  Mp. 
Although  the  subgaussian  coefficient  k{x)  =  \/2,  the  directional  fourth  moment  v(x)  =  121//4v/p. 

For  other  kinds  of  distributions,  the  subgaussian  coefficient  k  may  be  substantially  larger  than 
the  uniform  fourth  moment  v.  Examples  of  this  phenomenon  already  emerge  in  the  univariate  case. 


3.2.  Main  Result  for  Masked  Covariance  Estimation.  The  following  theorem  provides  de¬ 
tailed  information  about  the  expectation  and  tail  behavior  of  the  error  in  the  masked  sample 
covariance  estimator  for  a  zero-mean  subgaussian  distribution. 


Theorem  3.2  (Masked  Covariance  Estimation  for  Subgaussian  Distributions).  Fix  a  p  x  p  sym¬ 
metric  mask  matrix  M .  Suppose  that  x  is  a  subgaussian  random  vector  in  with  mean  zero. 
Define  the  covariance  matrix  E  and  the  sample  covariance  matrix  En  as  in  (1.1)  and  (1.2).  Then 
the  expected  estimation  error  satisfies 


E  M0E„  — M0S  < 


16k2 fi2  ||M||^2log(2ep) 


n 


1/2 


+ 


4 k2  ||  M ||  log2(2enp) 


n 


Furthermore,  for  each  t  >  0,  the  estimation  error  satisfies  the  tail  bound 

-nt2/  2 


M  0  En  -  AT©  E  >t 


|  <  2  ep 


•  exp 


8k2u2  ||M||^2  +  4k2  \\M\\  log(4 np)  ■  t 


(3.3) 


(3.4) 


The  subgaussian  coefficient  k  and  the  uniform  fourth  moment  v  are  defined  in  (3.1)  and  (3.2). 


The  proof  of  Theorem  3.2  appears  in  Section  3.5.  We  can  extend  this  result  to  the  case  where 
we  center  the  observations  using  the  sample  mean  before  computing  the  sample  covariance;  the 
argument  is  identical  with  the  one  described  by  Levina  and  Vershynin  [LV11,  Rem.  4]  for  the 
Gaussian  case. 


3.2.1.  Interpretation  and  Consequences.  Let  us  take  a  moment  to  discuss  Theorem  3.2.  First,  we 
note  that  the  error  in  the  masked  sample  covariance  estimator  can  be  expressed  as 

M0S„-M0E  =  -  V"  M  Q(xiX* -Exx*),  (3.5) 

n  /—'i= i 

using  the  definitions  (1.1)  and  (1.2)  of  the  covariance  and  sample  covariance.  For  each  i,  the  paren¬ 
thesis  in  (3.5)  has  subexponential  tails  because  the  random  vector  Xi  is  subgaussian.  Therefore, 
the  formula  (3.5)  expresses  the  error  as  an  average  of  subexponential  random  variables. 

Consequently,  we  expect  the  estimation  error  to  obey  a  probability  inequality  just  like  (3.4).  For 
moderate  values  of  t,  the  error  (3.4)  exhibits  subgaussian  decay,  an  intimation  of  the  normal  profile 
that  emerges  when  the  number  of  samples  tends  to  infinity.  For  large  values  of  t,  the  error  has 
subexponential  decay,  owing  to  the  heavier  tails  of  the  summands  in  (3.5).  Likewise,  the  two  terms 
in  the  expected  error  bound  (3.3)  correspond  with  the  two  regimes  in  the  tail  bound.  The  first 
term  refiects  the  subgaussian  decay,  while  the  second  term  comes  from  the  subexponential  decay. 
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The  scale  for  subgaussian  decay  is  controlled  by  a  measure  of  the  variance  a2  of  each  summand: 

cr2  =  8K2n2  ||AT||2^2  . 

We  see  that  moderate  deviations  depend  on  the  local  properties  of  the  mask,  as  encapsulated 
in  ||AT||2^2.  The  appearance  of  the  subgaussian  coefficient  n  in  o2  refiects  the  variance  of  each 
component  of  the  random  vector.  The  presence  of  the  uniform  fourth  moment  v  shows  that  there 
is  also  a  role  for  the  spread  of  the  random  vector  in  every  direction. 

The  scale  for  subexponential  decay  is  controlled  by  a  second  quantity, 

R  =  4k2  1 1  AT 1 1  log(4np). 


Large  deviations  reflect  more  global  properties  of  the  mask  owing  to  the  presence  of  ||AT||.  The 
subgaussian  coefficient  k  arises  here  because  the  tails  of  the  distribution  drive  the  tails  of  the  error. 
Note  that  the  large-deviation  behavior  only  depends  on  the  individual  components  of  the  random 
vector  being  subgaussian;  we  attribute  this  fact  to  the  basis-dependent  nature  of  the  Schur  product. 
The  logarithmic  factor  in  R  emerges  from  a  truncation  argument,  and  we  believe  it  is  parasitic. 

We  can  obtain  a  sample  complexity  bound  directly  from  the  probability  inequality  (3.4)  in 
Theorem  3.2.  Assume  that  n  <  p  and  that  e  £  (0, 1).  Then 


n> 


lATH^logp  ||M||log> 

o  ' 


AT  0  £n  -  AT  0  £  <  ev2 


(3.6) 


with  high  probability.  The  square  v2  of  the  uniform  fourth  moment  has  the  same  homogeneity 
as  the  covariance  matrix,  so  (3.6)  is  a  type  of  relative  error  bound.  As  before,  the  first  summand 
reflects  the  subgaussian  part  of  the  tail,  while  the  second  summand  comes  from  the  subexponential 
part.  A  novel  feature  of  the  sample  bound  (3.6)  is  the  presence  of  the  ratio  n2 /v2,  which  is  a 
dimensionless  measure  of  the  shape  of  the  distribution.  This  ratio  can  be  very  large  or  very  small, 
so  it  should  be  assessed  within  the  scope  of  a  particular  application. 


3.3.  Specialization  to  Gaussian  Distributions.  It  is  natural  to  apply  Theorem  3.2  to  study 
the  performance  of  masked  covariance  estimation  for  a  zero-mean  Gaussian  random  vector.  In 
this  case,  the  covariance  matrix  determines  the  distribution  completely,  so  we  can  obtain  a  more 
transparent  statement  that  does  not  involve  the  concentration  parameters  n  and  v. 


Corollary  3.3  (Masked  Covariance  Estimation  for  Gaussian  Distributions).  Fix  apxp  symmetric 
mask  matrix  AT.  Suppose  that  x  is  a  Gaussian  random  vector  in  with  mean  zero.  Define  the 
covariance  matrix  E  and  the  sample  covariance  matrix  En  as  in  (1.1)  and  (1.2).  Then  the  expected 
estimation  error  satisfies 


E  M0Sn-M0S  < 


'56  MSI 


ISII  II ATI 


1— >2 


log(6p)  ,  8||S|jmax||AT||log2(6np) 


n 


+ 


n 


Furthermore,  for  each  t  >  0,  the  estimation  error  satisfies  the  tail  bound 


AT  0  XL  —  AT  0  E  >t\ 


<  6 p  ■  exp 


— nt- 


56 


E  E 

Umax  II  I 


I  ATI 


1— >2 


+  16||E| 


|  AT 1 1  log(4np)  •  t 


The  proof  of  Corollary  3.3  appears  below  in  Section  3.4.  Theorem  1.4  of  the  Introduction  follows 
quickly  from  this  result  when  we  apply  the  inequality  ||E||m  <  ||E||  and  complete  some  numerical 

estimates. 
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It  is  fruitful  to  compare  Corollary  3.3  directly  with  earlier  work  on  masked  covariance  estimation 
for  a  Gaussian  distribution.  Assume  that  n  <  p  and  e  £  (0,1)-  Then  Corollary  3.3  delivers  a 
sample  complexity  bound  of  the  form 


n>  C  • 


log p 


M ||  log2p 
£ 


M©S„-M0S  |  <  e  ||E||  (3.7) 


with  high  probability.  The  bound  (3.7)  is  similar  with  the  results  of  Levina  and  Vershynin  [LV11], 
stated  in  (1.13),  but  two  improvements  are  worth  mentioning. 

First,  recall  that  the  sample  complexity  bound  (1.6)  we  present  in  Example  1.1  depends  on 
the  absolute  maximum  entry  of  the  covariance  matrix,  rather  than  its  spectral  norm.  A  similar 
refinement  appears  in  the  bound  (3.7)  on  account  of  the  ratio  of  the  two  norms.  This  ratio  never 
exceeds  one,  and  it  can  be  as  small  as  p~l  for  particular  choices  of  the  covariance  matrix.  We 
interpret  this  term  as  saying  that  covariance  estimation  is  easier  when  the  variables  are  highly 
correlated  with  each  other.  This  represents  a  new  phenomenon  that  previous  authors  have  not 
identified. 

The  second  improvement  over  (1.13),  which  has  less  conceptual  significance,  is  the  reduction  of 
the  number  of  logarithmic  factors. 


3.4.  Proof  of  Corollary  3.3  from  Theorem  3.2.  The  result  for  Gaussian  distributions  is  a 
direct  consequence  of  the  main  theorem  because  the  covariance  matrix  El  of  a  zero-mean  Gaussian 
vector  x  characterizes  the  distribution  completely.  As  a  consequence,  we  just  need  to  estimate  the 
concentration  parameters  k(x)  and  v{x)  in  terms  of  XI. 

First,  we  compute  the  subgaussian  coefficient  k(x).  Observe  that  the  ith  component  X,  of  the 
vector  a;  is  a  Gaussian  random  variable  with  variance  an,  where  an  denotes  the  ith  diagonal  entry 
of  E.  The  usual  Gaussian  tail  bound  demonstrates  that 

P{|W|  >  t}  <  2 e_t2/2cr". 

According  to  Definition  2.1,  the  subgaussian  coefficient  k(W)2  <  ‘Ian,  and  so  the  subgaussian 
coefficient  of  the  vector  satisfies 

k(x)2  <  max*  2a a  =  2  ||E||max  . 

The  latter  equality  holds  because  the  absolute  maximum  entry  of  a  positive-definite  matrix  occurs 
on  its  diagonal. 

Next,  we  compute  the  uniform  fourth  moment  v[x).  Fix  a  unit  vector  u.  The  distribution  of 
the  marginal  u*x  is  Gaussian  with  mean  zero.  To  compute  the  variance  a ^  of  the  marginal,  we 
write  x  =  E1/2^,  where  g  is  a  standard  Gaussian  vector.  Then 

cr2  =  E \u*x\2  =  E  |u*(E1/2g)|2  =  -it*E1/2(Egg*)E1/2w  =  u*T,u  <  ||E||  . 

The  fourth  moment  of  a  Gaussian  variable  equals  three  times  its  squared  variance,  so 

E  \u*x\4  =  3c4  <  3  ||  E || 2  . 

We  conclude  that  the  uniform  fourth  moment  satisfies 

v(x)  =  sup  (E  |w*ik|4)1//4  <  31//4  1 1 E 1 1 1,72  . 

\\u\\=l 

To  complete  the  argument,  substitute  the  estimates  for  k(x)  and  v{x)  into  Theorem  3.2  and 
make  some  numerical  estimates. 


3.5.  Proof  of  Theorem  3.2.  The  argument  follows  the  same  lines  as  the  classical  Laplace  trans¬ 
form  technique.  For  clarity,  we  break  the  presentation  into  discrete  steps. 
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3.5.1.  The  Matrix  Laplace  Transform  Method.  We  begin  with  the  proof  of  the  probability  inequal¬ 
ity  (3.4).  First,  split  the  tail  bound  for  the  spectral  norm  into  two  pieces: 

p|||M0En-M0E||  >t | 

<  IP  {Amax(M  0  Sn  -M0S)>t}  +  P  {Amax(M  0S-M0  Sn)  >  t}  .  (3.8) 

This  inequality  depends  on  the  fact  ||A||  =  rnax{Amax(A),  Amax(— A)},  valid  for  each  self-adjoint 
matrix  A,  and  an  invocation  of  the  union  bound.  We  develop  an  estimate  for  the  first  term  on  the 
right-hand  side  of  (3.8);  an  essentially  identical  argument  applies  to  the  second  term. 

The  matrix  Laplace  transform  bound,  Proposition  2.2,  allows  us  to  control  the  first  term  on  the 
right-hand  side  of  (3.8)  in  terms  of  a  matrix  mgf. 

P  |Amax(M  ©  -  M  ©  S)  >  t }  =  P  |  Amax  (n{M  0Sn-M©S))  >  ntj 

<  inf  |e_e>nt  •  Etr  exp  ( 6n(M  ©  Sn  -  M  ©  £))  }  .  (3.9) 

In  the  first  line  of  (3.9),  we  have  rescaled  both  sides  of  the  event  and  applied  the  positive  homo¬ 
geneity  of  the  maximum  eigenvalue.  Let  us  introduce  notation  for  the  trace  of  the  matrix  mgf: 

E{9)  :=  Etrexp  ( 9n(M  ©  —  M  0  £)^  .  (3.10) 

Our  main  task  is  to  obtain  a  suitable  bound  for  E{9). 

3.5.2.  Symmetrizing  the  Random  Sum.  The  random  matrix  appearing  in  (3.10)  admits  a  natural 
expression  as  a  sum  of  centered,  independent  random  matrices.  To  see  why,  substitute  the  defini¬ 
tions  (1.1)  and  (1.2)  of  the  population  covariance  matrix  El  and  the  sample  covariance  matrix  Xn 
to  obtain 

E(0)  =  Etrexp  (X^-i  0(M  ©  “  E M  ©  xix* ))  • 

The  samples  aq, . . . ,  xn  are  statistically  independent,  so  the  summands  are  independent,  centered 
random  matrices.  Therefore,  we  may  apply  the  symmetrization  lemma,  Proposition  2.3,  to  reach 

E(9)  <  Etrexp  (£”.  L  20£i(M  ©  XiX*)^j  ,  (3.11) 

where  {e,}  is  a  sequence  of  independent  Rademacher  random  variables  that  is  also  independent 
from  the  sequence  {aq}  of  samples.  The  benefit  of  the  estimate  (3.11)  is  that  each  Schur  product 
involves  a  rank-one  matrix,  which  greatly  simplifies  our  computations. 


3.5.3.  Matrix  eg f  Bound  for  the  Matrix  mgf.  The  summands  on  the  right-hand  side  of  (3.11)  are 
i.i.d.,  so  we  can  apply  Proposition  2.4  on  the  subadditivity  of  matrix  cgfs  to  see  that 

E{9)  <  trexp(ra  •  logEexp(26LILf  0  xx*)).  (3.12) 


The  chief  technical  contribution  of  this  paper  consists  in  the  following  matrix  cgf  bound: 


2  '2 


logEexp(2 9eM  Q  xx*)  ^ 


9z(j 


1 


2(l-9R)'I+n'1 


-L 


(3.13) 


where 

cr2  :=  8k2v2  \\M\\1^2  and  R  :=  An2  \\M\\ log(4np).  (3.14) 

The  concentration  parameters  k  and  v  that  characterize  x  are  defined  as  in  (3.1)  and  (3.2).  The 
calculation  underlying  (3.13)  requires  several  pages  and  some  substantial  new  ideas.  We  encapsulate 
the  details  in  Lemma  4.1,  which  is  the  subject  of  Section  4. 
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The  trace  exponential  is  monotone  with  respect  to  the  semidefinite  order  [Pet94,  Prop.  1],  so  we 
can  substitute  the  cgf  bound  (3.13)  into  our  estimate  (3.12)  for  E(9).  Thus, 

m  - tr  exp  ' 1 + 1) =  ep ' exp  )  -  <3J5) 
The  second  relation  depends  on  the  fact  that  the  identity  matrix  has  dimension  p.  The  inequal¬ 
ity  (3.15)  is  just  what  we  need  to  establish  the  probability  inequality  and  the  expectation  bound 
that  constitute  the  conclusions  of  Theorem  3.2. 

3.5.4.  Probability  Bound  for  the  Estimation  Error.  We  are  now  prepared  to  complete  our  bound 
for  the  tail  probability,  initiated  in  (3.8).  Substitute  the  estimate  (3.15)  for  the  matrix  mgf  into 
the  Laplace  transform  bound  (3.9)  to  discover  that 

IP  |Amax  (M  ©  %n  -  M  0  e)  >  t  j  <  ep  •  inf  exp  (^-9nt  +  2 J 

Select  the  classical  value  for  the  parameter:  6  =  t/(a2  +  Rt).  This  choice  yields  an  upper  bound 
for  the  first  term  on  the  right-hand  side  of  (3.8): 

IP  |  Amax  (m  0  £„  -  M  >  t  j  <  ep  ■  exp  •  (3.16) 

The  second  term  on  the  right-hand  side  of  (3.8)  admits  the  same  upper  bound: 

IP  |  Amax  (m  0S-M0  >  i}  <  ep  ■  exp  ^2(o-2  )  '  (3-17) 

The  proof  of  (3.17)  is  essentially  identical  with  the  proof  of  (3.16),  so  we  omit  the  details. 

Finally,  recall  the  definition  (3.14)  for  the  quantities  a2  and  R.  Then  introduce  the  rela¬ 
tions  (3.16)  and  (3.17)  into  the  probability  inequality  (3.8)  to  establish  the  tail  bound  (3.4)  stated 
in  Theorem  3.2. 

3.5.5.  Bound  for  the  Expected  Estimation  Error.  Although  it  is  possible  to  control  the  expected 
error  by  integrating  the  tail  bound  (3.4),  we  obtain  somewhat  better  results  through  a  direct 
application  of  the  estimate  (3.15)  for  the  matrix  mgf  E{0). 

The  argument  is  based  on  the  following  inequality,  of  independent  interest,  which  provides  a 
way  to  bound  the  expected  spectral  norm  of  a  matrix  in  terms  of  its  mgf.  Let  Z  be  a  random, 
self-adjoint  matrix,  and  fix  a  positive  number  9.  We  have  the  following  chain  of  relations: 

E\\Z\\  <  r^logEe011211 

=  9~x  logEemax{A  max  (OZ),  Amax  {-oz) } 

=  e^1  log E max  |Amax(eez),  Amax(e^z)j 

<  9~l  log  ^Etr  eez  +  Etre-0Z^  .  (3.18) 

For  the  first  inequality,  multiply  and  divide  by  9;  then  invoke  Jensen’s  inequality  to  bound  the  ex¬ 
pectation  by  an  exponential  mean.  The  second  relation  expresses  the  spectral  norm  of  a  symmetric 
matrix  in  terms  of  eigenvalues.  In  the  third  line,  we  pull  the  maximum  through  the  exponential 
and  then  apply  the  spectral  mapping  theorem  to  draw  out  the  eigenvalue  maps.  Finally,  replace 
the  maximum  by  a  sum,  and  bound  the  maximum  eigenvalue  of  the  matrix  exponential,  which  is 
positive  definite,  by  the  trace. 

We  intend  to  apply  (3.18)  to  the  random  matrix 

Z  =  n{M  0  Sn  —  M  ©  S). 
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According  to  the  definition  (3.10)  of  the  function  E{6),  the  trace  of  the  mgf  of  the  matrix  Z 
coincides  with  E{6).  Therefore,  when  the  parameter  6  £  (0,  R”1),  our  upper  bound  (3.15)  for  E(0) 
demonstrates  that 

Etr  eez  =  E(9)  <  ep  •  exp  J  .  (3.19) 

The  argument  underlying  the  bound  (3.15)  for  the  trace  mgf  of  Z  also  applies  to  —Z,  whereby 

Etr  e~ez  <  ep  •  exp  Q  J  ^  ^  .  (3.20) 

Introduce  (3.19)  and  (3.20)  into  the  norm  bound  (3.18)  to  reach 

n  •  E  \\M  ©  Sn  -  M  0  S||  <  6~l  ^log(2ep)  +  ' 

Minimize  the  right-hand  side  over  admissible  values  of  8,  ideally  with  a  computer  algebra  system. 
This  computation  yields 

n  •  E  ||M  ©  —  M  ©  S||  <  y/2(j2nlog(2ep)  +  Rlog(2ep). 

Divide  through  by  n  and  recall  the  definition  (3.14)  of  the  quantities  a2  and  R.  Combine  the 
two  logarithms  in  the  second  term  to  complete  the  proof  of  the  expected  error  bound  (3.3)  from 
Theorem  3.2. 


4.  The  Matrix  cgf  of  a  Schur  Product 


In  this  section,  we  work  out  the  details  of  the  matrix  cgf  bound  (3.13)  that  stands  at  the  center 
of  Theorem  3.2.  The  following  lemma  contains  a  complete  statement  of  the  result. 


Lemma  4.1  (Matrix  cgf  Bound  for  a  Schur  Product).  Fix  a  self-adjoint  matrix  M .  Let  x  = 
(Xi, . . . .  Xp)*  be  a  random  vector,  and  let  e  be  a  Rademacher  variable,  independent  from  x.  For 
each  positive  integer  n, 


logEexp(2 OeM  Q  xx *)  =4 


e2a2 

2(1  -  OR) 


when  9  £  (0,  R  x), 


where 

a2  :=  8 k2zz2  ||M||^2  and  R  :=  4k2  ||Af  ||  log(4rap). 

The  concentration  parameters  k  and  v  associated  with  x  are  defined  as  in  (3.1)  and  (3.2). 


To  prove  Lemma  4.1,  we  would  like  to  invoke  the  Bernstein  mgf  bound,  Proposition  2.5,  but 
several  obstacles  stand  in  the  way.  First,  estimating  the  variance  of  the  random  matrix  2 eM  Qxx* 
involves  a  surprisingly  delicate  calculation.  Second,  this  random  matrix  is  typically  unbounded, 
which  requires  us  to  develop  a  new  type  of  truncation  argument.  We  address  ourselves  to  these 
tasks  in  the  next  two  subsections. 


4.1.  Computing  the  Variance.  The  Bernstein  mgf  bound  demands  that  we  compute  the  variance 
of  the  random  matrix  2 eM  0  xx* .  The  following  lemma  contains  this  estimate.  Our  key  insight  is 
that  the  monotonicity  (2.2)  of  the  Schur  product  allows  us  to  replace  one  factor  in  the  product  by 
a  scalar  matrix.  This  act  of  diagonalization  simplifies  the  estimate  tremendously  because  we  erase 
the  off-diagonal  entries  when  we  take  the  Schur  product  with  an  identity  matrix. 

Lemma  4.2  (Semidefinite  variance  bound).  Under  the  assumptions  of  Lemma  j.l,  it  holds  that 

E(2 eM  ©  xx*)2  =4  8/tV  ||M||2^2  •  I. 
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Proof.  First,  we  treat  the  leading  constant  and  the  Rademacher  random  variable. 

E(2£M0m*)2  =  4E(M0xj:’)2.  (4.1) 

The  expectation  with  respect  to  x  is  not  so  easy  to  handle.  To  begin,  we  perform  some  algebraic 
manipulations  to  consolidate  the  remaining  randomness.  The  Schur  product  identity  (2.1)  implies 
that 

(AT  ©  xx*)2  =  (diag (x)M  diag(*))2 

=  diag(ai)(AT  diag(*)2AT)  diag(s)  =  ( AT  diag (x)2M)  ©  xx* . 

Rewrite  the  diagonal  matrix  as  a  linear  combination  of  matrix  units:  diag(tc)2  =  ]U  -  X2  Etl .  The 
bilinearity  of  the  Schur  product  now  yields 

{MQxx*)2  =  AT  X2  Ejj)  AT  Qxx*  =  ^2XMEiiM)Q(X2xx*). 

Take  the  expectation  of  this  expression  to  reach 

E(AT  ©  xx*)2  =  V  (MEaM)  ©  [E(X2  xx*)}.  (4.2) 

Next,  we  invoke  the  monotonicity  (2.2)  of  the  Schur  product  to  make  a  diagonal  estimate  for 
each  summand  in  (4.2): 

(ME„M)  ©  [E(X2xx*)\  4  Amax(E(X2*cc*))  •  (ME..M)  ©I. 

The  Rayleigh-Ritz  variational  formula  [Bha97,  Cor.  III. 1.2]  allows  us  to  write  the  maximum  eigen¬ 
value  as  a  supremum.  Thus, 

Amax(E(X2;E£E*))  =  SUp  U*  \E(X2  XX*)\u  =  SUp  E  \x2  I’ll**]2] 

||u||=l  ||  ix||  =1 

<  sup  (EX4)1/2  (E  l-u**]4)1/2  <  2 n(Xi)2  sup  (E  \u*x\A)1^2  <  2 k2u2. 

||-u||=l  ||ll.||  =  l 

The  first  inequality  is  Cauchy-Schwarz.  For  the  second  inequality,  we  apply  (2.3)  to  bound  the 
fourth  moment  of  Xj  in  terms  of  the  subgaussian  coefficient.  The  final  inequality  follows  from  the 
definitions  (3.1)  and  (3.2)  of  the  concentration  parameters.  Combine  the  last  two  displays  to  obtain 

(MEItM)  ©  [E(X2a;a:*)]  ^  2k  V  •  (MEItM)  ©  I.  (4.3) 

To  complete  our  bound  for  the  variance,  we  introduce  (4.3)  into  (4.2),  which  delivers 

E(M  ©  xx*)2  =4  2kV  •  M2  ©  I 

The  remaining  matrix  is  diagonal,  so  we  can  control  it  using  only  its  maximum  entry: 

E {M  ©  xx*)2  ^  2h2v2  maxj(AT2)jj  •  I  =  2k2v2  ||M"||2^2  •  I 

The  second  relation  follows  from  the  fact  that  the  diagonal  entries  of  M2  list  the  squared  norms 
of  the  columns  of  AT,  and  || iXT|| x computes  the  maximum  column  norm  of  AT.  Substitute  the 
latter  expression  into  (4.1)  to  conclude.  □ 

4.2.  Proof  of  Lemma  4.1.  This  subsection  contains  the  main  steps  in  the  proof  of  Lemma  4.1. 
We  begin  by  explaining  the  motivation  behind  our  approach. 

We  would  like  to  invoke  the  Bernstein  matrix  mgf  inequality,  Proposition  2.5,  to  control  the  mgf 
of  2e AT  ©  xx* .  This  proposition  requires  the  maximum  eigenvalue  of  the  random  matrix  to  satisfy 
an  almost  sure  bound.  Using  the  Schur  product  identity  (2.1),  we  can  develop  a  simple  estimate 
for  the  maximum  eigenvalue: 

Amax(2eAT  ©  XX*)  <  2  ||diag(*)AT  diag(*)||  <  2  ||AT||  ||diag(a;)  ||2  =  2  ||AT||  11*1]^, .  (4.4) 

Unfortunately,  the  random  variable  11*1]^  is  typically  unbounded,  which  suggests  that  we  cannot 
apply  the  Bernstein  approach  directly. 
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To  tackle  this  problem,  we  develop  a  truncation  argument  in  Section  4.2.1,  which  splits  the 
distribution  of  the  random  matrix  2 eM  ©  xx*  into  two  pieces,  depending  on  the  size  of  ||aj||  . 

This  technique  allows  us  to  apply  the  Bernstein  estimate  to  the  bounded  part  of  the  random 
matrix  (Section  4.2.2).  To  handle  the  unbounded  part,  we  use  the  inequality  (4.4)  to  develop  a 
coarse  tail  estimate  that  we  can  integrate  directly  (Section  4.2.3).  Section  4.2.4  combines  these 
results  to  complete  the  argument. 

4.2.1.  The  Truncation  Argument.  As  we  have  explained,  we  intend  to  decompose  the  random  ma¬ 
trix  2 eM  0  xx*  based  on  the  magnitude  of  the  random  variable  11*11^-  To  that  end,  define  the 
event 

**-={\\x\\  1<B},  (4.5) 

where  we  determine  a  suitable  truncation  level  B  later. 

Now,  let  us  split  the  matrix  mgf  into  expectations  over  srf  and  srfc\ 

E  exp(29eM  0  xx* )  =  E  [exp(2 OeM  0  xx* )  1^\  +  E  [exp(2 9eM  0  xx* )  l^c] 

^  Eexp((2 9eM  0  xx*)t^)  +  E  [exp(2 9eM  0  xx*)t^c] .  (4.6) 

The  first  identity  follows  because  the  two  indicators  form  a  partition  of  unity.  In  the  second  line, 
notice  that  the  first  term  can  only  increase  in  the  semidefinite  order  when  we  draw  the  indicator 
t_e/  into  the  exponential. 


4.2.2.  Bernstein  Estimate  for  the  Bounded  Part  of  the  Random  Matrix.  We  can  interpret  the  first 
term  on  the  right-hand  side  of  (4.6)  as  the  mgf  of  a  random  matrix  whose  maximum  eigenvalue  is 
bounded;  this  matrix  mgf  admits  a  Bernstein-type  estimate. 

We  must  verify  that  the  truncated  matrix  (2eM  ©  xx *)!#/  satisfies  the  hypotheses  of  Proposi¬ 
tion  2.5.  First,  note  that 

E[(2 eM  0  xx*)\^\  =  0 

because  the  Rademacher  variable  e  is  independent  from  x  and,  hence,  from  srf .  Second,  continuing 
the  calculation  (4.4),  we  determine  that  the  maximum  eigenvalue  is  bounded. 

Xmax{{2eM  Q  xx*)l^)  <2\\M\\  ||aj||^  •  <  2B\\M\\  .  (4.7) 

The  second  inequality  in  (4.7)  relies  on  the  definition  (4.5)  of  the  truncation  event.  Third,  we  apply 
Lemma  4.2  to  obtain  a  semidefinite  bound  for  the  variance. 

E[(2eM  ©  xx*)!#/]2  =4  E[(2 eM  ©  xx*)2}  =4  8 u2k2  \\M\\x^2  •  I.  (4.8) 

Of  course,  discarding  the  indicator  in  (4.8)  only  increases  the  semidefinite  order. 

In  view  of  (4.7)  and  (4.8),  we  define  a  variance  parameter  and  a  uniform  bound  parameter 

(T2  :=  8k2r2  1 1  .M 1 1  ^2  and  R.:=2B\\M\\.  (4.9) 

Finally,  we  apply  Proposition  2.5  and  the  variance  estimate  (4.8)  to  achieve 

Q2  2 

Eexp((2 9eM  0  xx*)!*)  I  +  dR)  •  I.  (4.10) 


The  relation  (4.10)  is  valid  for  all  9  G  (0,i?  x). 
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4.2.3.  Controlling  the  Unbounded  Part  of  the  Random  Matrix.  We  treat  the  second  term  on  the 
right-hand  side  of  (4.6)  by  making  a  rough  bound  that  we  can  integrate  directly.  First,  observe 
that 

exp(2 OeM  ©  xx*)  ^  exp(20  •  Amax(M'  0  ax*))  •  I  ^  exp(20  \\M\\  Hx]]^)  •  I. 

We  have  applied  the  semidefinite  relation  eA  =4  eAmax^  •  I,  valid  for  each  self-adjoint  matrix  A, 
followed  by  the  eigenvalue  bound  (4.4).  Multiply  both  sides  by  the  indicator  l^c,  and  take  the 
expectation  to  reach 

E[exp(2 OeM  ©  ax*)l^c]  =<!  E[exp(2#  \\M |j  ||*||^0)l,eH  •  I 

=:E[eaWl^c] -I.  (4.11) 


In  the  expression  (4.11),  we  have  abbreviated 

a:=26\\M\\  and  W’:=||x||^0.  (4-12) 

We  apply  classical  techniques  to  bound  the  remaining  expectation. 

Observe  that  we  can  control  the  tail  probability  of  W  using  the  subgaussian  coefficient  k.  Indeed, 

P  {W  >  w}  =  P  | max*  \Xi\2  >  rc| 

<  ^_iP{|X,|2  >  u;}  <  2e-w/K^2  <  2 pe~w/K\  (4.13) 

The  second  relation  is  the  union  bound.  The  third  follows  from  Definition  2.1  of  the  subgaussian 
coefficient  n(X)  of  a  random  variable  X,  while  the  last  depends  on  the  definition  (3.1)  of  the 
subgaussian  coefficient  k  of  the  random  vector  x. 

Next,  we  invoke  a  standard  integration-by-parts  argument  [Bil79,  Eqn.  (21.10)]  to  study  the 
expectation  in  (4.11).  Since  stfc  =  {W  >  B}, 


~K[eaW tj^c]  =eaB  -  ¥{W  >  B}  +  a  eaw  ■  ¥  {W  >  rc}  du; 

Jb 

r  oo 


<  eaB  •  2 pe 


-B/k2 


poo 

a  eaw -2pe-w/K2  dw 

Jb 


=  2  p 


1  + 


a 


1/n2  —  a 


^-(1/k2-o)B 


(4.14) 


We  have  used  the  tail  bound  (4.13)  twice  to  obtain  the  inequality  in  the  second  line.  The  third  line 
follows  when  we  evaluate  the  definite  integral  under  the  assumption  that  a  <  1/k2. 

To  continue  the  bound  on  the  right-hand  side  of  (4.14),  we  need  to  make  a  careful  estimate. 
Owing  to  definition  (4.12)  of  a,  the  condition 


< 


1 


4k2  \\M\ 


a  < 


2k2' 


(4.15) 


Assume  that  6  satisfies  the  hypothesis  of  (4.15).  Now,  observe  that  the  right-hand  side  of  the 
inequality  (4.14)  is  an  increasing  function  of  a.  Therefore,  we  may  increase  a  to  1/2 k2  on  the 
right-hand  side  of  (4.14)  and  then  set  the  truncation  level 

B  =  2k2  log(4?rp)  (4.16) 


to  obtain  the  bound 

E[eQWl^c]  <4pe“s/2K2  =  -. 

n 

Introduce  this  expression  into  (4.11)  to  conclude  that 


E[exp(2 OeM  0  xx*)t^c\  ^ 


n 


(4.17) 
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Finally,  we  verify  that  the  truncation  level  B  forces  the  parameter  9  to  satisfy  the  hypothesis 
of  (4.15).  Recall  the  definition  (4.9)  of  the  bound  parameter  and  the  definition  (4.16)  of  the 
truncation  level  to  see  that 

R  =  2 B  ||Af||  =  4 a2  ||.M||  log(4np). 

We  have  already  assumed  that  6  <  R~l.  It  follows  that 

„  111  1 

R  ~  log(4 np)  '  4a2  \\M\\  ~  4 a2  \\M\\ ' 

This  observation  completes  the  tail  estimate. 


4.2.4.  Combining  the  Results.  We  have  obtained  estimates  for  the  two  terms  in  our  truncation 
bound  (4.6).  Introduce  (4.10)  and  (4.17)  into  (4.6)  to  reach 

02a2  1 

Eexp(2 6eM  ©  xx *)  =4  I  +  — - — —  •  H - I, 

2(1  —  0R)  n 

where  a2  and  R  are  defined  in  (4.9).  We  have  also  assumed  that  9  £  (0 ,  R_1).  The  logarithm  is 
operator  monotone  [Bha07,  Exer.  4.2.5],  so 


logEexp(2 9eM  ©  xx*)  log 


92a2 
2(1  -  R) 


To  complete  the  proof  of  Lemma  4.1,  we  invoke  the  semidefinite  relation  log(I  +  A)  =<I  A.  which 
holds  for  each  positive  semidefinite  matrix  A. 
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